home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
ftp.cs.arizona.edu
/
ftp.cs.arizona.edu.tar
/
ftp.cs.arizona.edu
/
icon
/
newsgrp
/
group01b.txt
/
000188_icon-group-sender_Thu Dec 6 08:27:25 2001.msg
< prev
next >
Wrap
Internet Message Format
|
2002-01-03
|
4KB
Return-Path: <icon-group-sender>
Received: (from root@localhost)
by baskerville.CS.Arizona.EDU (8.11.1/8.11.1) id fB6FQMU24132
for icon-group-addresses; Thu, 6 Dec 2001 08:26:22 -0700 (MST)
Message-Id: <200112061526.fB6FQMU24132@baskerville.CS.Arizona.EDU>
Date: Thu, 6 Dec 2001 01:18:29 -0600 (CST)
From: John Paolillo <johnp@ling.uta.edu>
To: eka@corp.cirrus.com, trutkin@physics.clarku.edu
Subject: Re: Bio Informatics
Cc: icon-group@cs.arizona.edu
Errors-To: icon-group-errors@cs.arizona.edu
Status: RO
Content-Length: 3651
In formal-language theoretic terms, regular expressions
allow the definition of finite-state transducers (FST's). These
can be thought of as being described by ordinary finite
state automata (FSA) with one difference, namely that they have
two tapes, an input and an output tape, and state transitions
can specify read-actions on the input tape and write
actions on the output tape. Some backtracking is normally
allowed. All the same, we're pretty far down the Chomsky
hierarchy in terms of computational power, with a little bit
more than the ability to recognize regular languages (the
term used in the Computational Linguistics literature for
what FST's compute is "regular relation").
Icon string scanning is indeed more powerful. It is very
easy to write parsers that are context-free in their
computational power, using no more than the implicit
arguments (&subject, &pos, etc.) of the scanning
environment. When explicit argument-passing is allowed,
it becomes possible to write scanning procedures that are
context-sensitive and Turing-equivalent in their
computational power.
So the relationship between Icon scanning and Perl regexp
is that Icon is the more powerful and expressive of the two.
This is both a "good" thing and a "bad" thing: scanning
expressions that use the context-free or context sensitive
expressivity of Icon's scanning expressions may have a
worst-case exponential (i.e. intractable) complexity. FST-
based scanning is generally more tractable. If built-in
regexps are used for scanning, you are basically forced
to adopt a strategy from a tractable set of computations.
Not a bad thing if you are worried about performance, but
frustrating if you need the missing expressivity for
something.
In computational linguistics, there is renewed interest in
FSA and FST-based methods, on account of the tractability
issues, even though it is generally recognized that something
more (what A. Joshi called "weak context sensitivity") is
probably needed to adequately describe human languages.
For Bio-informatics, where the main tasks are matching and
assembling long gene sequences, where we don't expect to
find troubling context-sensitive dependencies (if I have
such-and-such a thousand-base sequence, then I should expect
an approximate copy of such-and-such another sequence over
there...), FSA and FST methods are probably enough. (I doubt
that anyone would have tried to use context free methods even,
though I could well be wrong.) And since Perl comes plonked
into just about any Linux or Unix installation these days,
you can get going on your sequencing just by downloading a
few scripts from someone's web or ftp site. I think that
would be the main reason for people using Perl (rather than,
say, Java...)
Of course, one can easily write procedures for FSA and FST
using Icon's string scanning (I suppose this is how the
regexp library routines work?). This would be my preference,
for purely idiosyncratic reasons. I just think that the
design of the scanning environment in Icon is an elegant
and brilliant piece of work. The use of the implicit
variables is what I would consider a revolutionary concept
in programming language design. Once you understand it,
you almost never have to interact with it; it just seamlessly
does its work for you while you attend to what is interesting
in your program. How unlike Prolog DCG's (which I use a lot
now) or even regexps. Besides, you never know when you'll
want just a little bit more expressivity...
Anyway, I hope this helps,
John Paolillo
SLIS and Informatics
Indiana University
http://ella.slis.indiana.edu/~paolillo/